1 Abstract

This program is offered by Leanne Hyndman, a counsellor with Amber Community, and aims to find useful information to Amber Community by analysing key counselling intake data alongside publically available data in order to provide a clearer understanding of the work of Amber Community.

To achieve this aim, I cleaned the data from Amber Community, joined it with other data, did some exploration analysis, made the time series analysis, and compared the population and referrals spatial distribution.

The data contains the following variables for each of the referrals received:

The key findings are:

  1. the number of referrals during the time shows a weak seasonality weekly and monthly, which may be due to the regular schedule of staff;
  2. the institution seems to have different levels of “attractiveness” to different regions: the people who live in regional areas seem to be more likely to come to the institution;
  3. Statewide and metropolitan Covid-19 Lockdown Policies do have a significant influence on Amber Community.

The data is from three sources: Amber Community for the essential dataset “referrals.csv”, the r package called “adsmapsdata” for the map data, and the government’s website for demographics(address: https://www.coronavirus.vic.gov.au/victorian-coronavirus-covid-19-data).

2 Background and Motivation

Amber Community, formerly Road Trauma Support Services Victoria, is a not-for-profit organization contributing to the safety and well-being of road users. To provide data-driven support.

They provide counselling and support to people affected by road trauma and address the attitudes and behaviours of road users through education. They deliver a range of education programs addressing the behaviours and attitudes of drivers to reduce the incidence of crashes, injuries and fatalities, and the associated trauma and grief.

Now, they are interested in a few questions and wish that we can provide some data-driven support to them so that they can have more insight into their meaningful work and contribute more to road users’ mental and physical health.

3 Objectives and Significance

\[\frac{a_1}{a_2}-\frac{b_1}{b_2}\]

4 Data Exploration

The data looks tidy but a bit messy with a lot of missing values. With the help of my mentor Rob Hyndman, I removed useless variables and cleaned the variable names and created the table referrals_clean1.

Then I checked the data type, name and value in each variable using the function glimpse():

After checking that, I renamed the variable and combined the categorical variables’ values that are identical at a certain level and then created a new table called referrals_clean2.

Then I checked the data type, name and value in each variable:

## Rows: 8,037
## Columns: 8
## $ x1                <dbl> 393, 14, 16, 73, 318, 254, 257, 18, 122, 138, 173, 2…
## $ referral_id       <dbl> 402, 18, 20, 77, 326, 261, 264, 22, 128, 144, 179, 2…
## $ date_received     <date> 2016-06-18, 2016-07-01, 2016-07-01, 2016-07-03, 201…
## $ date_entered      <date> 2016-08-19, 2016-07-04, 2016-07-05, 2016-07-04, 201…
## $ client_type       <chr> "driver", "driver", "driver", "driver", "injured pas…
## $ referred_by       <chr> "VPeR", "VPeR", "VPeR", "VPeR", "VPeR", "VPeR", "VPe…
## $ we_suggest_you_do <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ postcode          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

After checking that, I renamed the variable and combined the catogorical variables’ values and created referrals_clean2.

Next I created a new variable “day_of_week” and made a table and a bar plot to check whether more referrals are received on Monday

Table 4.1: day_of_week summary
day_of_week proportion
Sun 11.011572
Mon 17.357223
Tue 15.553067
Wed 14.657210
Thu 14.955829
Fri 14.806520
Sat 10.265024
NA 1.393555

Figure 4.1: proportion of days of week

This might be because usually some referrals on weekend are postponed to the following day, which is Monday, and this can also explain why Tuesday has second-largest amount of referrals, and the rest of the weekdays(Wednesday, Thursday and Friday) have about the same amount, which is still significantly larger(around 40%) than the weekend. Next, I created a new variable “day_of_week” to check whether more referrals are received on Monday. According to figure 4.1 , there are more referrals on Monday than on other days of the week, which indicates that there are possibilities of seasonality. According to the table and the figure above, there are more referrals on Monday than on other days of the week. This might be because usually some referrals on weekends are postponed to the following day, which is Monday, and this can also explain why Tuesday has the second-largest amount of referrals, and the rest of the weekdays(Wednesday, Thursday and Friday) have about the same amount, which is still significantly larger(around 40%) than the weekend.

I had a look at the summary of the client_type variable:

Table 4.2: the summary of client type
client_type n
driver 2662
witness 2452
bereaved 1038
fam/fr of casualty 500
other injured person 428
passenger 428
unknown 272
rider 241
other 16

From table 4.2 above we can see that the major source of clients is from drivers, witnesses and bereaved people.

I also had a look at the summary of the referred_by variable:

Table 4.3: the number of referrals from different sources
referred_by n
VPeR 6876
Self 409
VSA 223
unknown 196
TAC 191
Other 55
Family/friend 37
Police 35
Victims of Crime 15

Figure 4.3 shows that clearly, VPeR is the major source of referrals.

To find out whether the patterns mentioned above are constant, I did more analysis:

5 Time Series Analysis

Firstly as mentioned before, figure 4.1 indicates that there is a seasonal pattern in weekly data, and figure 4.1 also shows whether the pattern exists or changes over time and we can conclude that the pattern seems to be almost unchanged over the period.

number of referrals per week

Figure 5.1: number of referrals per week

We can see from figure 5.1 a cyclic behaviour from the start till 2020, and a significant drop in 2020 which might be the effect of both cyclic behaviour and the lockdown policy. Then the number of referrals go back slowly to the cyclic pattern showing the significant influence of the lockdown policy.

Although a bit messy, we can see from figure 5.1 that the number of referrals is usually lower in winter than that in summer. It will be clearer in monthly plot 5.2:

number of referrals per month

Figure 5.2: number of referrals per month

Maybe the drop in the 2nd quarter is because the staff in the organization usually have a vacation due to the cold weather in the 2nd quarter.

subseries of monthly data

Figure 5.3: subseries of monthly data

From figure /@ref(fig:sub-month) above we can see that almost every month there is a significant drop in the year 2020, which can be the consequence of Australia’s lock-down policy for COVID-19.

This graph shows subseries of the referrals data by month. As is shown above the peaks in terms of the number of referrals are in March and October, and the troughs are in May and Sep. Most months show a downward trend from 2017 to 2022 but have troughs in 2020 or 2021, which may be a consequence of covid-19. But overall the seasonality is weak as the ETS function suggests an MNN model which does not contain a seasonal component:

## Series: n_referrals 
## Model: ETS(M,N,N) 
##   Smoothing parameters:
##     alpha = 0.3389616 
## 
##   Initial states:
##      l[0]
##  134.8555
## 
##   sigma^2:  0.0609
## 
##      AIC     AICc      BIC 
## 755.0811 755.4503 761.7834
I also made two interactive plots to show the changes in client types and sources of referrals over the period:

Figure 5.4: the number of referrals by referred_by

As is shown above, the major source of referrals has been vper over the period and the referrals from vsa disappeared since 2019 Sep. This may be due to a classification error of workers entering data.

Figure 5.5: the number of referrals by client type

As is shown above, the three most common client types are witness, driver, and the bereaved over the period.

Last figure 5.6 shows how the gap between the date the referrals were received and the date the clients entered changes over the period of COVID-19:
the gap between the date the referrals were received and the date the clients entered

Figure 5.6: the gap between the date the referrals were received and the date the clients entered

Although not very apparent, we can still observe the downward pattern followed by a flat upwards pattern in this gap plot, combining with the drop of referrals after the outbreak of covid-19, we can infer that the number of referrals does to some degree influence the speed of processing referrals, and this indicates that the workload might be close to the upper workload limit of Amber Community.

6 Spacial Data Analysis

In this part, I joined three datasets:

To make comparison between the population proportion and the proportion of the number of referrals in a certain area by postcode, I named a variable “attractiveness”, which is computed by the formula below: \[\frac{a_1}{a_2}-\frac{b_1}{b_2}\] In this formula, \(a_1\) stands for the number of referrals in a certain postcode area, \(a_2\) stands for the number of referrals in the whole area, \(b_1\) stands for the population in a certain postcode area and \(b_2\) stands for the population in the whole area. So the larger the value is, the more “attractive Amber Community is to a certain area.

The reason I didn’t use a1/b1 is the problem with robustness when there is zero referral because in some areas the number of referrals(a1) is zero, therefore a1/b1 is also zero, but the same “zero” can have different meaning since the size of the population in that certain area(b1) can vary a lot.

There are four variables in the interactive map: attractiveness, postcode, number of referrals, and population. I also plot the red dot that represents the location of Amber Community. Figure 6.1 shows that Amber Community is more attractive to regional areas than it does to the metropolitan area in VIC.

Figure 6.1: the map

Note: the data does have its limitation: there is not enough data to compute the accurate value of true “attractiveness”, so in this interactive map 6.1, I also included the number of referrals and the population in a certain area for reference.

7 Package citations

I used R version 4.2.1 (R Core Team 2022) and the following R packages: absmapsdata v. 1.3.3 (Mackey 2022), bookdown v. 0.27 (Xie 2016, 2022a), fpp3 v. 0.4.0 (Hyndman 2021), ggiraph v. 0.8.2 (Gohel and Skintzos 2022), ggpubr v. 0.4.0 (Kassambara 2020), grateful v. 0.1.11 (Rodríguez-Sánchez, Jackson, and Hutchins 2022), htmlwidgets v. 1.5.4 (Vaidyanathan et al. 2021), janitor v. 2.1.0 (Firke 2021), knitr v. 1.39 (Xie 2014, 2015, 2022b), leaflet v. 2.1.1 (Cheng, Karambelkar, and Xie 2022), plotly v. 4.10.0 (Sievert 2020), RColorBrewer v. 1.1.3 (Neuwirth 2022), rmarkdown v. 2.14 (Xie, Allaire, and Grolemund 2018; Xie, Dervieux, and Riederer 2020; Allaire et al. 2022), sf v. 1.0.7 (Pebesma 2018), tidyverse v. 1.3.1 (Wickham et al. 2019).

Allaire, JJ, Yihui Xie, Jonathan McPherson, Javier Luraschi, Kevin Ushey, Aron Atkins, Hadley Wickham, Joe Cheng, Winston Chang, and Richard Iannone. 2022. Rmarkdown: Dynamic Documents for r. https://github.com/rstudio/rmarkdown.

Firke, Sam. 2021. Janitor: Simple Tools for Examining and Cleaning Dirty Data. https://CRAN.R-project.org/package=janitor.

Hyndman, Rob. 2021. Fpp3: Data for “Forecasting: Principles and Practice” (3rd Edition). https://CRAN.R-project.org/package=fpp3.

Kassambara, Alboukadel. 2020. Ggpubr: ’Ggplot2’ Based Publication Ready Plots. https://CRAN.R-project.org/package=ggpubr.

Mackey, Will. 2022. Absmapsdata: A Catalogue of Ready-to-Use ASGS (and Other) Sf Objects. Neuwirth, Erich. 2022. RColorBrewer: ColorBrewer Palettes. https://CRAN.R-project.org/package=RColorBrewer.

Pebesma, Edzer. 2018. “Simple Features for R: Standardized Support for Spatial Vector Data.” The R Journal 10 (1): 439–46. https://doi.org/10.32614/RJ-2018-009.

R Core Team. 2022. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Rodríguez-Sánchez, Francisco, Connor P. Jackson, and Shaurita D. Hutchins. 2022. Grateful: Facilitate Citation of r Packages. https://github.com/Pakillo/grateful.

Sievert, Carson. 2020. Interactive Web-Based Data Visualization with r, Plotly, and Shiny. Chapman; Hall/CRC. https://plotly-r.com.

Vaidyanathan, Ramnath, Yihui Xie, JJ Allaire, Joe Cheng, Carson Sievert, and Kenton Russell. 2021. Htmlwidgets: HTML Widgets for r. https://CRAN.R-project.org/package=htmlwidgets.

Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.

Xie, Yihui. 2014. “Knitr: A Comprehensive Tool for Reproducible Research in R.” In Implementing Reproducible Computational Research, edited by Victoria Stodden, Friedrich Leisch, and Roger D. Peng. Chapman; Hall/CRC. http://www.crcpress.com/product/isbn/9781466561595.

———. 2015. Dynamic Documents with R and Knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. https://yihui.org/knitr/.

———. 2016. Bookdown: Authoring Books and Technical Documents with R Markdown. Boca Raton, Florida: Chapman; Hall/CRC. https://bookdown.org/yihui/bookdown.

———. 2022a. Bookdown: Authoring Books and Technical Documents with r Markdown. https://github.com/rstudio/bookdown.

———. 2022b. Knitr: A General-Purpose Package for Dynamic Report Generation in r. https://yihui.org/knitr/.

Xie, Yihui, J. J. Allaire, and Garrett Grolemund. 2018. R Markdown: The Definitive Guide. Boca Raton, Florida: Chapman; Hall/CRC. https://bookdown.org/yihui/rmarkdown.

Xie, Yihui, Christophe Dervieux, and Emily Riederer. 2020. R Markdown Cookbook. Boca Raton, Florida: Chapman; Hall/CRC. https://bookdown.org/yihui/rmarkdown-cookbook.